This study analyzes the primary factors influencing rental and buying property prices in Madrid’s housing market, leveraging a dataset from Kaggle. The analysis involved extensive data cleaning to address missing values, standardize variables, and handle outliers, ensuring robust results. Exploratory data analysis revealed trends based on property type, size, number of rooms, energy certification, year of construction, location, and amenities like lifts and parking. Larger properties in central and northern neighborhoods command higher prices, while flats dominate the market due to their affordability and versatility. Luxury properties, such as penthouses, reflect a niche demand for exclusivity. Predictive modeling was conducted using Lasso Regression, Linear Regression, and Random Forest to forecast log-transformed rental and buying prices. Gaussian Mixture Models (GMM) were applied to classify rental prices into “Inexpensive” and “Expensive” categories, improving the prediction accuracy. Random Forest emerged as the best-performing model, achieving the lowest RMSE (0.1783596 for buying prices, 0.1300083 for rental prices) and the highest R² (0.9318762 and 0.9160775, respectively), demonstrating its ability to capture complex relationships in the data. Limitations included the dataset’s static nature and assumptions made for handling missing values. The results provide valuable insights for stakeholders, including investors and policymakers, and highlight the importance of location, property size, and amenities in determining housing prices. These findings establish a foundation for further research and predictive modeling in urban real estate markets.
1 Introduction
1.1 Project Goals
The primary goal of this project is to analyze the key factors influencing rental and buying property prices in Madrid, a city with a dynamic and competitive real estate landscape driven by rising urbanization and increasing demand. By conducting an exploratory data analysis, we aim to uncover insights into the characteristics and distributions of the data, focusing on factors such as property type, size, neighborhood, energy certification, and features like parking availability or garden access. This analysis also includes visualizing geographical trends across districts in Madrid to highlight patterns in house prices. To enhance the depth of the study, predictive modeling is employed to estimate property prices based on location and other attributes, allowing us to compare the performance of different models and determine the most effective approach for accurate predictions. This comprehensive analysis not only provides valuable insights for investors, policymakers, and residents but also serves as a reusable framework for analyzing real estate markets in other cities by adapting the code to different contexts.
1.2 Related Work
Our project builds on previous studies that have applied machine learning and statistical methods to analyse housing markets. Kenyon, Arribas-Bel, and Robinson (2024) used clustering techniques to analyze spatial patterns in Madrid, emphasizing the role of location-based features in understanding housing sub-markets. Baldominos et al. (2018) applied predictive models to identify underpriced properties in Madrid, demonstrating how machine learning can generate actionable insights for real estate markets. Milcheva and Zhu (2020) reviewed various machine learning methods for house price prediction, highlighting the effectiveness of advanced models, such as gradient boosting and random forests, in capturing complex relationships. These works informed both our exploratory data analysis and modeling approach. Papers like Jafari and Akhavian (2019) often focus primarily on the most obvious factors influencing housing, such as size, location, and age. Similarly, this analysis addresses these factors in research question 1 but goes further by incorporating less obvious or seemingly irrelevant factors that might still hold predictive value. Through research question 3, this study broadens the scope by analyzing additional factors, like whether the exact address is hidden on online rental platforms, to enhance the predictive power in the housing market. Furthermore, some authors tend to focus exclusively on either the purchasing or rental market. This analysis bridges that gap by examining both markets, enabling meaningful comparisons between them.
1.3 Research Questions
Question 1: What are the primary factors that influence real estate prices in Madrid? This question analyses the correlation of the renting and/or buying price with other variables, such as house type, size and other characteristics of the house and property.
Question 2: How does the location within Madrid affect real estate prices? Are there specific neighbourhoods appreciating or depreciating the prices? By using the neighbourhood variable, we aim to map the properties to specific areas within Madrid and to identify price trends within those areas.
Question 3: What impact do secondary features of the house and property have on real estate prices? In comparison to the first question which focuses on the main features of the house – such as the type, size and number of rooms – this question focuses on additional characteristics – such as the type of energy certificate, availability of parking spaces or access to a garden – and their correlation with the real estate price.
2 Data
2.1 Sources
The dataset was sourced from Kaggle, a popular platform for data science competitions and sharing datasets. The specific dataset used is titled “Madrid Real Estate Market,” curated by Mirbek Toktogaraev. It contains detailed information about the real estate market in Madrid, including property characteristics, pricing, and location data. The dataset consists of listings from popular real estate portals of Madrid.
2.2 Description
The dataset consists of 21,742 rows and 58 columns, with a mix of numeric and categorical variables. The dataset is not regularly updated, meaning the analysis reflects conditions at the time of the data’s collection.
The key features can be grouped into several categories:
▫ sq_mt_built: Total built area (in square meters) of the property.
▫ n_rooms: Number of rooms in the property.
▫ n_bathrooms: Number of bathrooms in the property.
▫ floor: The floor number of the property.
▫ house_type_id: Type of the property (e.g., apartment, house).
▫ built_year: The year in which the property was constructed.
▫ neighborhood_id: The neighborhood in which the property is located.
▫ is_exact_address_hidden: A boolean indicating if the exact address is available or not.
Note: The columns for latitude and longitude (geographical coordinates) are present but completely null, so they cannot be used for geospatial analysis without additional data sources.
▫ rent_price: The rental price of the property (in euros).
▫ buy_price: The purchase price of the property (in euros).
▫ buy_price_by_area: Price per square meter for properties listed for sale (in euros/m²).
▫ has_parking: A boolean indicating whether the property has parking facilities.
▫ energy_certificate: The energy efficiency rating of the property.
2.3 Wrangling/Cleaning
To prepare the dataset for analysis, we implemented a comprehensive cleaning process to ensure data consistency and usability. This involved removing redundant and irrelevant columns, handling missing and erroneous values, standardizing categorical and numerical variables, and addressing outliers. Key steps included dropping columns with minimal analytical value, filling binary feature columns with logical FALSE values for missing data, converting text-based floor levels to numeric, and standardizing neighborhood names and house types. Additionally, we addressed outliers through visual inspection and corrected implausible values, such as an erroneous built_year.
More precisely, the steps included:
Removing Redundant and Irrelevant Columns:
Dropped several columns not useful for analysis, including title, subtitle, raw_address, operation (contained only “sale”), and others with minimal impact on the analysis.
Handling Missing Values:
Columns with all NA values were removed.
Filtered out rows with missing values in critical columns, such as built_year and floor.
Replaced NA values with FALSE in specific binary variables, including has_ac, has_garden, has_pool, has_terrace, has_balcony, has_storage_room, and is_accessible.
This decision was based on the observation that only TRUE values were present in these columns, likely indicating that the platform allowed users to specify features they had (by selecting TRUE) but not features they lacked (by selecting FALSE). Therefore, any NA values were logically assumed to mean the absence of the feature.
Handling negative values in rent_price
Columns with all negative values were removed as negative rent_prices do not make sense.
Remove specific columns that are redundant or not useful.
Remove observations where built_year or floor is NA.
Fill in the NA values for certain columns with FALSE, as mentioned in point 2.
Floor Level Conversion:
Cleaned the floor variable by converting text-based descriptions to numeric values:
Converted “Bajo” (ground floor) to 0, “Sótano” (basement) to -1, “Semi-sótano” (semi-basement) to -0.5, and “Entreplanta” (mezzanine) to 0.5.
Converted the floor column to numeric for analysis.
Modify values in house_type_id:
Converted columns neighborhood_id, house_type_id, and energy_certificate to factors.
Standardized house_type_id values as follows:
HouseType 1: Pisos renamed to Flat
HouseType 4: Dúplex renamed to Duplex
HouseType 5: Áticos renamed to Penthouse
HouseType 2: Casa o chalet (not present in final cleaned data)
Modify values in energy_certificate:
Modified values in energy_certificate to standardize the categories:
Added categories for energy ratings: A, B, C, D, E, F, G
Renamed “en trámite” to “in process”
Renamed “no indicado” to “not indicated”
Renamed “inmueble exento” to “exempt”
Verify which rows still have NA values after the above cleaning steps and remove those.
Cleaning Neighborhood Information:
Standardized neighborhood names in neighborhood_id.
Cleaning Neighborhood Information:
Removing numerical identifiers and irrelevant text to make data handling easier.
Outlier Detection and Adjustment:
Created boxplots to visually inspect outliers in numeric variables, leading to corrections like changing a built_year value from 8170 to 1870.
Removed observations with negative values in rent_price, as these were likely erroneous.
Remove has_garden column:
After all the cleaning steps, we realized that there are only FALSE values in the has_garden column and all the TRUE values have been deleted due to other cleaning steps. This is why we decided to remove the has_garden column.
Code
# Load dataset and remove the first row because it contains a headerhouses_Madrid <-suppressMessages(read_excel(here::here("data/processed/houses_Madrid.xlsx"), skip =1))data <- houses_Madrid# Convert specific columns to factors (categorical variables)data <- data %>%mutate(neighborhood_id =as.factor(neighborhood_id),house_type_id =as.factor(house_type_id),energy_certificate =as.factor(energy_certificate) )# A. Introducing a cleaned dataset with the following changes# 1. Remove the first column because it is redundantdata_cleaned <- data[, -1]# 2. Remove columns that have all NA values except the headerdata_cleaned <- data_cleaned[, colSums(!is.na(data_cleaned)) >0]# 3. Remove observations with a negative rent_pricedata_cleaned <- data_cleaned[data_cleaned$rent_price >=0, ]# 4. Remove specific columns that are redundant or not usefuldata_cleaned <- data_cleaned %>%select(-title, # Covered by neighborhood_id-subtitle, # Covered by neighborhood_id-raw_address, # Covered by neighborhood_id-street_name, # Covered by neighborhood_id-street_number, # Covered by neighborhood_id-n_floors, # Less important than floors data-is_floor_under, # Redundant with floors data-sq_mt_useful, # Less important than sq_mt_built-sq_mt_allotment, # Less important than sq_mt_built-is_rent_price_known, # All values are FALSE-is_buy_price_known, # All values are TRUE-operation, # All values are "sale"-buy_price_by_area, # rent_price_by_area is not known, removal to facilitate comparison-is_new_development, # Redundant with built_year-has_central_heating, # Less important-has_individual_heating, # Less important-has_fitted_wardrobes, # Less important-is_exterior, # Less important-has_green_zones, # Less important-is_parking_included_in_price, # Less important-parking_price, # Less important-is_orientation_north, # Less important-is_orientation_west, # Less important-is_orientation_south, # Less important-is_orientation_east # Less important )# 5. Remove observations where 'built_year' or 'floor' is NAdata_cleaned <- data_cleaned %>%filter(!is.na(built_year), !is.na(floor))# 6. Fill in NA values for certain columns with "FALSE"data_cleaned <- data_cleaned %>%mutate(has_ac =ifelse(is.na(has_ac), FALSE, has_ac),has_garden =ifelse(is.na(has_garden), FALSE, has_garden),has_pool =ifelse(is.na(has_pool), FALSE, has_pool),has_terrace =ifelse(is.na(has_terrace), FALSE, has_terrace),has_balcony =ifelse(is.na(has_balcony), FALSE, has_balcony),has_storage_room =ifelse(is.na(has_storage_room), FALSE, has_storage_room),is_accessible =ifelse(is.na(is_accessible), FALSE, is_accessible) )# 7. In the column floor:# 7.1 "bajo" is used to say groundfloor, we are changing that to simply zero# 7.2 "Semi-sótano", "Semi-sótano exterior", "Semi-sótano interior" is used to say semi-basement, we are changing that to -0.5# 7.3 "Sótano", "Sótano exterior", "Sótano interior" is used to say basement, we are changing that to -1# 7.4 "Entreplanta", "Entreplanta exterior", "Entreplanta interior" is used to say mezzanine, we are changing that to 0.5data_cleaned <- data_cleaned %>%mutate(floor =case_when( floor =="Bajo"~"0", floor %in%c("Semi-sótano", "Semi-sótano exterior", "Semi-sótano interior") ~"-0.5", floor %in%c("Sótano", "Sótano exterior", "Sótano interior") ~"-1", floor %in%c("Entreplanta", "Entreplanta exterior", "Entreplanta interior") ~"0.5",TRUE~ floor ) ) %>%mutate(floor =as.numeric(floor)) # Convert to numeric for analysis# 8. Rename values in house_type_id columndata_cleaned <- data_cleaned %>%mutate(house_type_id =trimws(house_type_id), # Remove any leading/trailing whitespacehouse_type_id =case_when( house_type_id =="HouseType 1: Pisos"~"Flat", house_type_id =="HouseType 2: Casa o chalet"~"House", house_type_id =="HouseType 4: Dúplex"~"Duplex", house_type_id =="HouseType 5: Áticos"~"Penthouse",TRUE~ house_type_id # Keep other values unchanged ) ) %>%filter(!is.na(house_type_id)) # Remove rows with NA in 'house_type_id'# 9. Modify values in the 'energy_certificate' columndata_cleaned <- data_cleaned %>%mutate(energy_certificate =case_when( energy_certificate =="en trámite"~"in process", energy_certificate =="no indicado"~"not indicated", energy_certificate =="inmueble exento"~"exempt",TRUE~ energy_certificate # Keep original values for other cases ))# 10. Check which rows still have NA values and remove thosedata_cleaned <- data_cleaned %>%filter(!is.na(sq_mt_built), !is.na(n_bathrooms), !is.na(has_lift))# 11. Clean the neighborhood_id column to keep only district namesdata_cleaned <- data_cleaned %>%mutate(# Extract only the district name from neighborhood_id by removing everything before "District X: "neighborhood_id =str_replace(neighborhood_id, ".*District \\d+: ", ""),# 12. Remove any leading or trailing whitespace in district namesneighborhood_id =trimws(neighborhood_id) )# 13. Handling outliers# 13.1 Identify outliers using boxplotsnumeric_columns <-sapply(data_cleaned, is.numeric) # Identify numeric columnsnumeric_data <- data_cleaned[, numeric_columns] # Subset numeric data# 13.2 Create boxplots for each numeric variable and display them two by twoplot_list <-list() # List to store plotsfor (column incolnames(numeric_data)) { plot <-ggplot(data_cleaned, aes_string(x ="factor(1)", y = column)) +geom_boxplot() +labs(title =paste("Boxplot of", column), y = column) +theme_minimal() +theme(axis.title.x =element_blank()) plot_list <-append(plot_list, list(plot))}
Initial boxplot visualizations were used to identify potential outliers and anomalies in the data (as mentioned in point 13). Additional visualizations and in-depth analyses will be conducted as part of the Exploratory Data Analysis (EDA) chapter.
Code
# Display the boxplots two by twofor (i inseq(1, length(plot_list), by =2)) {if (i +1<=length(plot_list)) {print(plot_list[[i]] + plot_list[[i +1]]) # Print two plots side by side } else {print(plot_list[[i]]) # If odd number of plots, print the last one by itself }}
Code
# 13.3 Check the data for anomalies and replace the incorrect value of 8170 with 1870data_cleaned$built_year[data_cleaned$built_year ==8170] <-1870# 14. Removing the has_garden column from the data_cleaned datasetdata_cleaned <- data_cleaned %>%select(-has_garden)
...1 id title subtitle
Min. : 0 Min. : 1 Length:21742 Length:21742
1st Qu.: 5435 1st Qu.: 5436 Class :character Class :character
Median :10870 Median :10872 Mode :character Mode :character
Mean :10870 Mean :10872
3rd Qu.:16306 3rd Qu.:16307
Max. :21741 Max. :21742
sq_mt_built sq_mt_useful n_rooms n_bathrooms
Min. : 13.0 Min. : 1.0 Min. : 0.000 Min. : 1.000
1st Qu.: 70.0 1st Qu.: 59.0 1st Qu.: 2.000 1st Qu.: 1.000
Median :100.0 Median : 79.0 Median : 3.000 Median : 2.000
Mean :146.9 Mean :103.5 Mean : 3.006 Mean : 2.092
3rd Qu.:162.0 3rd Qu.:113.0 3rd Qu.: 4.000 3rd Qu.: 2.000
Max. :999.0 Max. :998.0 Max. :24.000 Max. :16.000
NA's :126 NA's :13514 NA's :16
n_floors sq_mt_allotment latitude longitude
Min. :1.000 Min. : 1.0 Mode:logical Mode:logical
1st Qu.:2.000 1st Qu.: 2.0 NA's:21742 NA's:21742
Median :3.000 Median :232.0
Mean :3.129 Mean :241.7
3rd Qu.:4.000 3rd Qu.:354.0
Max. :7.000 Max. :997.0
NA's :20305 NA's :20310
raw_address is_exact_address_hidden street_name
Length:21742 Mode :logical Length:21742
Class :character FALSE:6740 Class :character
Mode :character TRUE :15002 Mode :character
street_number portal floor is_floor_under
Length:21742 Mode:logical Length:21742 Mode :logical
Class :character NA's:21742 Class :character FALSE:18033
Mode :character Mode :character TRUE :2539
NA's :1170
door
Mode:logical
NA's:21742
neighborhood_id
Neighborhood 23: Malasaña-Universidad (5196.25 €/m2) - District 4: Centro : 485
Neighborhood 59: Conde Orgaz-Piovera (4275.46 €/m2) - District 9: Hortaleza: 471
Neighborhood 28: El Viso (6255.45 €/m2) - District 5: Chamartín : 467
Neighborhood 32: Almagro (6564.27 €/m2) - District 6: Chamberí : 423
Neighborhood 72: Aravaca (3600.4 €/m2) - District 11: Moncloa : 419
Neighborhood 22: Lavapiés-Embajadores (4448.3 €/m2) - District 4: Centro : 385
(Other) :19092
operation rent_price rent_price_by_area is_rent_price_known
Length:21742 Min. :-34590276 Mode:logical Mode :logical
Class :character 1st Qu.: 725 NA's:21742 FALSE:21742
Mode :character Median : 1116
Mean : -59170
3rd Qu.: 1687
Max. : 2517
buy_price buy_price_by_area is_buy_price_known
Min. : 36000 Min. : 447 Mode:logical
1st Qu.: 198000 1st Qu.: 2551 TRUE:21742
Median : 375000 Median : 3720
Mean : 653736 Mean : 4021
3rd Qu.: 763600 3rd Qu.: 5000
Max. :8800000 Max. :18889
house_type_id is_renewal_needed is_new_development
HouseType 1: Pisos :17705 Mode :logical Mode :logical
HouseType 2: Casa o chalet: 1938 FALSE:17747 FALSE:19281
HouseType 4: Dúplex : 676 TRUE :3995 TRUE :1469
HouseType 5: Áticos : 1032 NA's :992
NA's : 391
built_year has_central_heating has_individual_heating are_pets_allowed
Min. :1723 Mode :logical Mode :logical Mode:logical
1st Qu.:1957 FALSE:9516 FALSE:4092 NA's:21742
Median :1970 TRUE :4092 TRUE :9516
Mean :1970 NA's :8134 NA's :8134
3rd Qu.:1994
Max. :8170
NA's :11742
has_ac has_fitted_wardrobes has_lift is_exterior
Mode:logical Mode:logical Mode :logical Mode :logical
TRUE:11211 TRUE:13399 FALSE:4461 FALSE:1777
NA's:10531 NA's:8343 TRUE :14895 TRUE :16922
NA's :2386 NA's :3043
has_garden has_pool has_terrace has_balcony has_storage_room
Mode:logical Mode:logical Mode:logical Mode:logical Mode:logical
TRUE:1556 TRUE:5171 TRUE:9548 TRUE:3321 TRUE:7698
NA's:20186 NA's:16571 NA's:12194 NA's:18421 NA's:14044
is_furnished is_kitchen_equipped is_accessible has_green_zones
Mode:logical Mode:logical Mode:logical Mode:logical
NA's:21742 NA's:21742 TRUE:4074 TRUE:4057
NA's:17668 NA's:17685
energy_certificate has_parking has_private_parking has_public_parking
en trámite :10937 Mode :logical Mode:logical Mode:logical
no indicado: 3669 FALSE:14023 NA's:21742 NA's:21742
E : 2701 TRUE :7719
D : 1121
G : 894
F : 674
(Other) : 1746
is_parking_included_in_price parking_price is_orientation_north
Mode :logical Min. : 0 Mode :logical
FALSE:604 1st Qu.: 0 FALSE:8567
TRUE :7115 Median : 0 TRUE :2791
NA's :14023 Mean : 2658 NA's :10384
3rd Qu.: 0
Max. :600000
NA's :14023
is_orientation_west is_orientation_south is_orientation_east
Mode :logical Mode :logical Mode :logical
FALSE:7287 FALSE:5535 FALSE:6302
TRUE :4071 TRUE :5823 TRUE :5056
NA's :10384 NA's :10384 NA's :10384
Code
summary(data_cleaned)
id sq_mt_built n_rooms n_bathrooms
Min. : 3 Min. : 16.0 Min. : 0.000 Min. : 1.000
1st Qu.: 4786 1st Qu.: 68.0 1st Qu.: 2.000 1st Qu.: 1.000
Median :10324 Median : 90.0 Median : 3.000 Median : 2.000
Mean :10644 Mean :105.6 Mean : 2.741 Mean : 1.675
3rd Qu.:16448 3rd Qu.:129.0 3rd Qu.: 3.000 3rd Qu.: 2.000
Max. :21742 Max. :620.0 Max. :14.000 Max. :14.000
is_exact_address_hidden floor neighborhood_id rent_price
Mode :logical Min. :-1.000 Length:7854 Min. : 35
FALSE:1828 1st Qu.: 1.000 Class :character 1st Qu.: 850
TRUE :6026 Median : 2.000 Mode :character Median :1223
Mean : 2.602 Mean :1335
3rd Qu.: 4.000 3rd Qu.:1758
Max. : 9.000 Max. :2517
buy_price house_type_id is_renewal_needed built_year
Min. : 36000 Length:7854 Mode :logical Min. :1800
1st Qu.: 180000 Class :character FALSE:6203 1st Qu.:1956
Median : 315000 Mode :character TRUE :1651 Median :1970
Mean : 404489 Mean :1968
3rd Qu.: 550000 3rd Qu.:1991
Max. :1378900 Max. :2022
has_ac has_lift has_pool has_terrace
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:3546 FALSE:1928 FALSE:6388 FALSE:4583
TRUE :4308 TRUE :5926 TRUE :1466 TRUE :3271
has_balcony has_storage_room is_accessible energy_certificate
Mode :logical Mode :logical Mode :logical Length:7854
FALSE:6565 FALSE:5386 FALSE:5679 Class :character
TRUE :1289 TRUE :2468 TRUE :2175 Mode :character
has_parking
Mode :logical
FALSE:5519
TRUE :2335
2.4 Spotting Mistakes and Missing Data
During the cleaning process, the following issues with mistakes and missing data were identified and addressed:
Missing Values: Columns with high NA counts were removed or had missing values filled with FALSE where appropriate. For example, missing data in has_ac, has_garden, and has_pool was assumed to indicate the absence of these features.
Wrong Value: We corrected a wrong value in built_year (from 8170 to 1870), as this seems to be twisted numbers.
Observations with negative values in rent_price were removed, as this did not make sense.
2.5 Listing Anomalies and Outliers
Outlier Identification: Used boxplots for visual inspection. If “>1.5xIQR”, then considered outlier
Handling Strategy: Retained legitimate outliers (e.g., high-end properties in prime locations) but removed invalid values (e.g., negative rent prices). Outliers were kept because housing properties can vary based e.g on luxuriousness, and other factors that are not included in our variables. Those high-value properties or very large properties may be unique but still real. In the Madrid housing market, this might include luxury penthouses, historic homes, or properties in prime neighborhoods. By keeping the outliers, we preserves data integrity and can also model for high end or unique properties.
3 Exploratory Data Analysis (EDA)
3.1 Initial Visualizations
3.1.1 Histograms of Rental and Buying prices
Rental: The concentration of rental prices between €800 and €1,200 suggests that there is a significant offer for affordable housing, making this the competitive range for renters. The fewer higher-priced rentals beyond €2,000 may indicate luxury or specialized properties, perhaps in prime locations or with additional amenities. This distribution reflects a standard market pattern where most rental offerings cater to budget-conscious tenants, with a small segment targeting high-income renters. There is also a noticeable spike at €2,200, which may represent a group of high-end properties. This could be due to larger properties, luxury amenities, or prime locations in desirable neighborhoods.
Buying: The buying market is accessible for a broad range of buyers, with most properties under €500,000. The long tail extending past €1,000,000 likely includes high-end or exclusive properties. This pattern indicates that while Madrid’s market has an affordable foundation, there’s also room for luxury properties, suggesting a market that accommodates both average-income buyers and high-end investors. Compared to the rent histogram, we see significantly more outliers in the high price segment. This could come from the fact that very expensive properties could be more likely to be sold instead of rented. An example would be that people usually prefer buying houses compared to renting them, as it is seen as an investment.
Code
# ADD UNIFIED COLOUR SCHEME# Rent Prices: Blue color schemerent_color ="blue"rent_color_light ="lightblue"# Buy Prices: Green color schemebuy_color ="green"buy_color_light ="lightgreen"# general colour if not specifically related to one of the 2 categoriesgeneral_color ="purple"general_color_light ="orchid"
# 1.1 Histogram of rental prices (Interactive)plotly::plot_ly(data = data_cleaned, x =~rent_price, type ="histogram",marker =list(color = rent_color)) %>%layout(title ="Distribution of Rental Prices",xaxis =list(title ="Rental Price (€)"),yaxis =list(title ="Frequency") )
Code
# 1.2 Histogram of buying prices (Interactive)plotly::plot_ly(data = data_cleaned, x =~buy_price, type ="histogram",marker =list(color = buy_color)) %>%layout(title ="Distribution of Buying Prices",xaxis =list(title ="Buying Price (€)"),yaxis =list(title ="Frequency") )
3.1.2 Energy Certificate Overview
3.1.2.1 Bar Chart
Most properties are either categorized as “in process” or have a “not indicated” energy certificate, meaning that a large portion of the dataset lacks finalized or available energy efficiency information. Among properties with specified ratings, the lower categories have the highest counts. This pattern could reflect the age of properties in the dataset, as older buildings tend to have lower energy efficiency. As Madrid is a city with many storic building that would match thet assumption.
Code
# 2.1 Bar chart for energy_certificate (interactive)plotly::plot_ly(data = data_cleaned %>%count(energy_certificate),x =~energy_certificate, y =~n, type ="bar",marker =list(color = general_color)) %>%layout(title ="Count of Energy Certificates",xaxis =list(title ="Energy Certificate"),yaxis =list(title ="Count") )
3.1.2.2 Boxplots Rental and Buying Prices by Energy Certificate
The boxplot of buying and rental prices by energy certificate shows no clear upward trend in property prices with improved energy ratings. While properties with ratings A and C appear to have slightly higher medians, there is still considerable overlap between most categories. Properties in categories like F and G still have a wide range of prices, showing that even less energy-efficient properties can have high prices, possibly due to factors like location or property size. As there is no strong price premium on energy-efficient properties, it could imply that buyers in this market are not placing a high value on energy ratings, or it may indicate that other factors, like location and size, are more influential in determining property prices.
# 2.2 Boxplot for rental prices by energy certificate (interactive)plotly::plot_ly(data = data_cleaned,x =~energy_certificate, y =~rent_price, type ="box",marker =list(color = rent_color)) %>%layout(title ="Rental Prices by Energy Certificate",xaxis =list(title ="Energy Certificate"),yaxis =list(title ="Rental Price (€)") )
Code
# 2.3 Boxplot for buying prices by energy certificate (interactive)plotly::plot_ly(data = data_cleaned,x =~energy_certificate,y =~buy_price,type ="box",boxpoints ="outliers", # Show outliersline =list(color = buy_color), # Set green color for box lines and whiskersmarker =list(color = buy_color), # Ensure outlier points are greenfillcolor = buy_color_light # Fill boxes with light green) %>%layout(title ="Buying Prices by Energy Certificate",xaxis =list(title ="Energy Certificate"),yaxis =list(title ="Buying Price (€)") )
3.1.3 House Type
3.1.3.1 Bar Chart
Flats make up the vast majority of properties in the dataset, with very few duplexes and penthouses. This distribution suggests that flats are the predominant housing type available in the market. As Madrid is an urban area, apartments are a common choice due to high population density and limited space.
Code
# 2.4 Bar chart for house_type_id (interactive)plotly::plot_ly(data = data_cleaned %>%count(house_type_id),x =~house_type_id, y =~n, type ="bar",marker =list(color = general_color)) %>%layout(title ="Count of House Types",xaxis =list(title ="House Type"),yaxis =list(title ="Count") )
3.1.3.2 Boxplots rental and buy prices by house type
Based on the median, duplexes and penthouses have higher prices, which highlights their role as premium options in the housing market. These types of properties often offer larger spaces, distinctive layouts, superior views, and greater privacy, which are all features that appeal to both buyers and renters prepared to invest more. Flats, on the other hand, cater to a wider audience, covering a range of price points from affordable to mid-range and even some high-end units. This range of pricing for flats reflects a flexible market. Flats act as the base for various buyers and renters, while duplexes and penthouses meet the specific demands of those seeking more exclusive, upscale living spaces.
# 2.5 Boxplot for rental prices by house type (interactive)plotly::plot_ly(data = data_cleaned,x =~house_type_id, y =~rent_price, type ="box",marker =list(color = rent_color)) %>%layout(title ="Rental Prices by House Type",xaxis =list(title ="House Type"),yaxis =list(title ="Rental Price (€)") )
Code
# 2.6 Boxplot for buying prices by house type (interactive)plotly::plot_ly(data = data_cleaned,x =~house_type_id, y =~buy_price, type ="box",fillcolor = buy_color_light, # Fill the boxes with light greenline =list(color = buy_color), # Set the box outlines and whiskers to greenmarker =list(color = buy_color) # Ensure outliers are green) %>%layout(title ="Buying Prices by House Type",xaxis =list(title ="House Type"),yaxis =list(title ="Buying Price (€)") )
3.1.4 Size on Price
The positive correlation between property size and buy and rental price indicates that the size of the property is a key determinant of both rental and purchase prices. Larger properties tend to be more expensive, both to rent and buy, aligning with market expectations. However, for properties above a certain size (around 200 square meters), price variability increases, reflecting the influence of additional factors like neighborhood, and other specific property features. This variability suggests a premium segment in the market for larger, potentially luxury properties, where size alone doesn’t set the price but combines with other value-adding factors. But also properties that probably are in need of renovation can be a cause for the lower priced ones. What is also evident in the scatter plot square meters on rent price is that there is a very distinct upper border for the prices. This could indicate some sort of capped rental prices in the period where the data was pulled. Another potential explanation could be that there is no demand for flats in that price segment, so owners tend to sell instead of rent the properties. Considering the max price border is that distinct, the first explanation appears more likely.
# 3.1 Scatter plot for sq_mt_built vs. rent_price (interactive)plotly::plot_ly(data = data_cleaned,x =~sq_mt_built, y =~rent_price,type ="scatter", mode ="markers",marker =list(color = rent_color, opacity =0.5) # Add transparency) %>%layout(title ="Size vs Rental Prices",xaxis =list(title ="Square Meters Built"),yaxis =list(title ="Rental Price (€)") )
Code
# 3.2 Scatter plot for sq_mt_built vs. buy_price (interactive)plotly::plot_ly(data = data_cleaned,x =~sq_mt_built, y =~buy_price,type ="scatter", mode ="markers",marker =list(color = buy_color, opacity =0.5) # Add transparency) %>%layout(title ="Size vs Buying Prices",xaxis =list(title ="Square Meters Built"),yaxis =list(title ="Buying Price (€)") )
3.1.5 Age on Price
The relationship between a property’s age and its price indicates that both buyers and renters value newer constructions. This could reflect demand for properties with modern amenities, better energy efficiency, and improved construction standards. However, the higher prices for very old buildings (from the 1800s) suggest that there is also a niche market for historic properties, likely due to their architectural charm or premium locations. For late-mid-20th-century properties, the lower prices may indicate a perception of these buildings as lacking in either historic value or modern features, making them less appealing. Those buildings are often classified as post second world war builings, where functionality and fast rebuilt was prioritized over aesthetics. These observations could be helpful for developers or investors targeting renovations or upgrades to increase appeal in older properties, especially mid-century ones that might benefit from modernization.
Code
# 4. Relationship Between Property Age and Prices# 4.1 Average rent and buy prices by built yearprice_by_built_year <- data_cleaned %>%group_by(built_year) %>%summarize(avg_rent_price =mean(rent_price, na.rm =TRUE),avg_buy_price =mean(buy_price, na.rm =TRUE) )
# 4.2 Scatter plot for average rental prices by built year (interactive)plotly::plot_ly(data = price_by_built_year,x =~built_year,y =~avg_rent_price,type ="scatter",mode ="markers",marker =list(color = rent_color, opacity =0.6, size =5) # Reduced point size) %>%add_trace(x =~price_by_built_year$built_year,y =~predict(loess(avg_rent_price ~ built_year, data = price_by_built_year)),mode ="lines",line =list(color = rent_color, width =2) # Smooth loess line with specified color and width ) %>%layout(title ="Average Rental Prices by Year Built",xaxis =list(title ="Year Built"),yaxis =list(title ="Average Rent Price (€)"),showlegend =FALSE# Hide legend for simplicity )
Code
# 4.3 Scatter plot for average buying prices by built year (interactive)plotly::plot_ly(data = price_by_built_year,x =~built_year,y =~avg_buy_price,type ="scatter",mode ="markers",marker =list(color = buy_color, opacity =0.6, size =5) # Smaller points with transparency) %>%add_trace(x =~price_by_built_year$built_year,y =~predict(loess(avg_buy_price ~ built_year, data = price_by_built_year)),mode ="lines",line =list(color = buy_color, width =2) # Loess smoothing line ) %>%layout(title ="Average Buying Prices by Year Built",xaxis =list(title ="Year Built"),yaxis =list(title ="Average Buy Price (€)"),showlegend =FALSE# Remove legend for simplicity )
3.1.6 Number of rooms and bathrooms on Price
The analysis indicates that the number of rooms and bathrooms significantly impacts both rental and buying prices. Properties with more rooms and bathrooms typically fall into the higher-end segment, appealing to families or buyers with greater purchasing power who value spaciousness and comfort. In the buying market, additional rooms and bathrooms noticeably drive up property values, as buyers appear willing to pay extra for these features. In contrast, the rental market shows a clear price ceiling, especially for larger properties, which suggests there’s a limit to what renters can afford or are willing to pay, regardless of additional space or amenities. This difference highlights that while both size and features add value, rental prices are generally more limited by affordability constraints than buying prices.
# 5.1 Boxplot for Rent Price by Number of Rooms (interactive)plotly::plot_ly(data = data_cleaned,x =~as.factor(n_rooms),y =~rent_price,type ="box",fillcolor = rent_color_light,line =list(color = rent_color),marker =list(color = rent_color)) %>%layout(title ="Rent Price by Number of Rooms",xaxis =list(title ="Number of Rooms"),yaxis =list(title ="Rent Price (€)") )
Code
# 5.2 Boxplot for Rent Price by Number of Bathrooms (interactive)plotly::plot_ly(data = data_cleaned,x =~as.factor(n_bathrooms),y =~rent_price,type ="box",fillcolor = rent_color_light,line =list(color = rent_color),marker =list(color = rent_color)) %>%layout(title ="Rent Price by Number of Bathrooms",xaxis =list(title ="Number of Bathrooms"),yaxis =list(title ="Rent Price (€)") )
Code
# 5.3 Boxplot for Buy Price by Number of Rooms (interactive)plotly::plot_ly(data = data_cleaned,x =~as.factor(n_rooms),y =~buy_price,type ="box",fillcolor = buy_color_light,line =list(color = buy_color),marker =list(color = buy_color)) %>%layout(title ="Buy Price by Number of Rooms",xaxis =list(title ="Number of Rooms"),yaxis =list(title ="Buy Price (€)") )
Code
# 5.4 Boxplot for Buy Price by Number of Bathrooms (interactive)plotly::plot_ly(data = data_cleaned,x =~as.factor(n_bathrooms),y =~buy_price,type ="box",fillcolor = buy_color_light,line =list(color = buy_color),marker =list(color = buy_color)) %>%layout(title ="Buy Price by Number of Bathrooms",xaxis =list(title ="Number of Bathrooms"),yaxis =list(title ="Buy Price (€)") )
3.1.7 Correlation Analysis
The heatmap illustrates the correlations between various numerical variables in the dataset. The strongest positive correlations are observed between square meters built and number of rooms, as well as between square meters built and buy price, suggesting that larger properties generally have more rooms and higher prices. Similarly, there is a moderate positive correlation between rent price and buy price, which indicates that properties with higher buying prices tend to also have higher rental prices. On the other hand, built year shows weak correlations with other variables, suggesting that the age of the property has a limited direct impact on its price or other features. However, this relationship may be influenced by factors not captured in the dataset, such as the property’s condition, location, or the demand for modern amenities versus historic value.
Code
# 6. Correlation Matrixcor_matrix <-cor(data_cleaned[sapply(data_cleaned, is.numeric)], use ="complete.obs")cor_melted <-melt(cor_matrix)# Heatmap of the correlation matrix (interactive)plotly::plot_ly(data = cor_melted,x =~Var1,y =~Var2,z =~value,type ="heatmap",colorscale =list(list(0, "blue"), # -1 correlation (blue)list(0.5, "white"), # 0 correlation (white)list(1, "red") # +1 correlation (red) ),zmin =-1, # Minimum value for the color scalezmax =1, # Maximum value for the color scalecolorbar =list(title ="Correlation")) %>%layout(title ="Correlation Matrix Heatmap",xaxis =list(title ="", tickangle =45), # Rotate x-axis labels for better readabilityyaxis =list(title ="") )
3.1.8 Rental and Buying Prices by Neighborhood
The highest rent prices per square meter are concentrated in central Madrid, with darker shades indicating premium areas. This central cluster is likely due to the demand for proximity to amenities, Schools, business centers, and cultural attractions. Beyond the central area, northern neighborhoods also show relatively high rent prices, suggesting they are seen as desirable places to live, possibly due to quality infrastructure, green spaces, and high-end residential areas. In contrast, the southern neighborhoods generally show lower rent prices, implying that they may be more affordable or less in demand compared to the north and center. Similar to rental prices, buying prices per square meter are highest in central and northern neighborhoods, with a clear gradient from high prices in the north to lower prices in the south. The northern part of Madrid likely attracts higher buying prices due to factors such as newer developments, higher-end residential zones, and perhaps perceived prestige.
Code
# 7. Relationship between Neighborhood and Price# 7.1 Create a new dataset and calculate average rent and buy price per square meter by neighborhoodneighborhood_prices <- data_cleaned %>%mutate(rent_price_per_sqm = rent_price / sq_mt_built,buy_price_per_sqm = buy_price / sq_mt_built ) %>%group_by(neighborhood_id) %>%summarise(avg_rent_price_per_sqm =mean(rent_price_per_sqm, na.rm =TRUE),avg_buy_price_per_sqm =mean(buy_price_per_sqm, na.rm =TRUE) )# 7.2 Download Shapefile of Madrid# Adjust the path to include the Barrios folder in your Downloads directorymadrid_shapefile <-st_read("data/raw/Distritos/Distritos.shp", quiet =TRUE)# 7.3 Update names in neighborhood_prices to match shapefile if neededneighborhood_prices <- neighborhood_prices %>%mutate(neighborhood_id =case_when( neighborhood_id =="Moncloa"~"Moncloa - Aravaca", neighborhood_id =="Fuencarral"~"Fuencarral - El Pardo",TRUE~ neighborhood_id ))# 7.4 Merge neighborhood_prices with madrid_shapefile based on neighborhood namemerged_neighborhood_prices <- madrid_shapefile %>%left_join(neighborhood_prices, by =c("NOMBRE"="neighborhood_id"))# Calculate area of each neighborhood polygon for scaling text size and add line breaks after the first blank space in district namesmerged_neighborhood_prices <- merged_neighborhood_prices %>%mutate(area =as.numeric(st_area(geometry)), # Ensure 'area' column for font scalingNOMBRE_wrapped =sub(" ", "\n", NOMBRE)) # Insert line break after the first space
# 7.5 Heatmap for Average Rent Price per Square Meter by Neighborhood with Black Border and District Namesggplot(merged_neighborhood_prices) +geom_sf(aes(fill = avg_rent_price_per_sqm), color ="black", size =0.1) +geom_text(data = merged_neighborhood_prices,aes(label = NOMBRE_wrapped, geometry = geometry,size =pmin(area /max(area) *3+1.5, 3)), # Scale font size with area, max 3stat ="sf_coordinates", # Center label on polygoncolor ="black") +scale_fill_gradient(low ="lightyellow", high ="darkred", name ="Avg Rent Price\nper Sq Meter") +scale_size_continuous(range =c(1.5, 3), guide ="none") +# Control size limitslabs(title ="Average Rent Price per Square Meter by Neighborhood in Madrid") +theme_minimal()
Code
# 7.6 Heatmap for Average Buy Price per Square Meter by Neighborhood with Black Border and District Namesggplot(merged_neighborhood_prices) +geom_sf(aes(fill = avg_buy_price_per_sqm), color ="black", size =0.1) +geom_text(data = merged_neighborhood_prices,aes(label = NOMBRE_wrapped, geometry = geometry,size =pmin(area /max(area) *3+1.5, 3)), # Scale font size with area, max 3stat ="sf_coordinates", # Center label on polygoncolor ="black") +scale_fill_gradient(low ="lightyellow", high ="darkred", name ="Avg Buy Price\nper Sq Meter") +scale_size_continuous(range =c(1.5, 3), guide ="none") +# Control size limitslabs(title ="Average Buy Price per Square Meter by Neighborhood in Madrid") +theme_minimal()
3.1.9 Distribution of Property Types Across Neighborhoods
The graph highlights the prevalence of flats across all neighborhoods in Madrid, reflecting their dominant role in meeting the city’s housing needs. Duplexes, while less common, are notable in neighborhoods such as Hortaleza, Moncloa , and Tetuán. This pattern could be linked to varying development trends, with duplexes potentially offering a compromise between space and urban accessibility. Their presence in both central and peripheral areas suggests they cater to diverse buyer preferences, from families to professionals seeking larger living spaces within the city. Penthouses, by contrast, are a niche option and show slightly higher proportions in Barajas and Moncloa. These areas, often less dense and quieter than central neighborhoods, may attract buyers interested in more exclusive housing options with added privacy and better views. However, the relatively low overall proportion of penthouses reflects their luxury positioning and limited availability in the market.
Code
# 8. Distribution of Property Types Across Neighborhoods (interactive)plotly::plot_ly(data = data_cleaned %>%count(neighborhood_id, house_type_id) %>%group_by(neighborhood_id) %>%mutate(proportion = n /sum(n)),x =~neighborhood_id,y =~proportion,color =~house_type_id,type ="bar") %>%layout(barmode ="stack",title ="Proportion of Property Types Across Neighborhoods",xaxis =list(title ="Neighborhood", tickangle =-45),yaxis =list(title ="Proportion"),legend =list(title =list(text ="House Type")) )
3.1.10 Energy Efficiency vs. Age of Property
This graph highlights how energy efficiency ratings vary by property age, reflecting changes in construction practices and regulations over time. Older properties, especially those built before 1950, are predominantly rated F or G, showing low energy efficiency due to outdated building techniques. In contrast, buildings constructed after 2000 exhibit higher energy efficiency, with more frequent ratings of A and B, likely driven by modern regulations and advancements in sustainable construction. The significant presence of “not indicated” and “in process” ratings for newer properties suggests gaps in certification or ongoing assessments.
Code
# 9. Relation Between Energy Efficiency and Age of Propertyggplot(data_cleaned, aes(x = built_year, y = energy_certificate, fill = energy_certificate)) +geom_density_ridges(alpha =0.7) +labs(title ="Energy Efficiency Rating by Property Age",x ="Year Built",y ="Energy Certificate",fill ="Energy Certificate") +theme_minimal()
3.1.11 Amenities vs. Prices
The graphs highlight the significant role of amenities in influencing property prices, with lifts showing the largest impact. This reflects their necessity in urban living, especially in multi-story buildings, where accessibility is crucial for families, the elderly, or those with mobility challenges. Properties without lifts, particularly in buildings constructed before lifts became standard, are less appealing, reducing their market value. Parking also drives prices, as secure parking is essential in areas with limited street options, catering to practical needs. Pools and storage rooms add value by addressing lifestyle preferences, appealing to those seeking comfort or extra utility. Terraces, while desirable for outdoor access, show a smaller impact, likely because their value depends on the property’s context, being more appealing in dense urban areas where outdoor space is limited. These trends emphasize how practical amenities like lifts and parking significantly influence demand, while luxury features cater to specific buyer or tenant preferences.
Code
# 10. Amenities vs. Prices# Pivot data for all amenitiesdata_cleaned_long <- data_cleaned %>%select(rent_price, buy_price, starts_with("has_")) %>%pivot_longer(cols =starts_with("has_"), names_to ="amenity", values_to ="presence")
# Boxplot for rent priceggplot(data_cleaned_long, aes(x =as.factor(presence), y = rent_price, fill =as.factor(presence))) +geom_boxplot(alpha =0.7) +facet_wrap(~ amenity, scales ="free_y") +labs(title ="Impact of Amenities on Rent Price",x ="Amenity Presence",y ="Rent Price (€)",fill ="Presence") +theme_minimal()
Code
# Boxplot for buy priceggplot(data_cleaned_long, aes(x =as.factor(presence), y = buy_price, fill =as.factor(presence))) +geom_boxplot(alpha =0.7) +facet_wrap(~ amenity, scales ="free_y") +labs(title ="Impact of Amenities on Buy Price",x ="Amenity Presence",y ="Buy Price (€)",fill ="Presence") +theme_minimal()
3.1.12 Amenity Across Property Types
The graph shows how amenities are distributed across property types, reflecting their different market positioning. Penthouses and duplexes typically offer more amenities, such as terraces and parking, appealing to buyers seeking space and luxury. Flats prioritize essential features like lifts, which are crucial for accessibility in urban areas, but have fewer lifestyle-focused amenities like pools or storage rooms, likely due to space and affordability constraints. Penthouses, associated with exclusivity, frequently include terraces and parking, catering to high-end preferences.
The summary statistics of the cleaned dataset reveal a diverse range of properties in Madrid. The mean property size is 106.4 m², but the median is 68.6 m², indicating a predominance of smaller properties. Most properties have around 2 rooms, and rental prices have a median of €1,223, while buying prices are higher, with a median of €315,000. The highest buying price exceeds €1.3 million, highlighting luxury properties in the market.
This diversity has already emerged when visualizing the data: Flats are the most common property type, followed by penthouses and duplexes, with the Centro neighborhood having the highest number of listings. Most properties have an energy certificate marked as “in process” or “not indicated,” and a notable portion in the “E” energy efficiency category. A significant share of properties offers amenities like air conditioning, terraces, and balconies, though many lack parking or accessibility.
With a mean built year of 1968, many properties are older and may require renewal. This reflects the presence of older buildings in the market. Overall, the dataset includes a wide variety of property types, sizes, and conditions, reflecting the diverse housing market in Madrid.
Code
# 7. Summary statistics of the cleaned datasummary(data_cleaned)
id sq_mt_built n_rooms n_bathrooms
Min. : 3 Min. : 16.0 Min. : 0.000 Min. : 1.000
1st Qu.: 4786 1st Qu.: 68.0 1st Qu.: 2.000 1st Qu.: 1.000
Median :10324 Median : 90.0 Median : 3.000 Median : 2.000
Mean :10644 Mean :105.6 Mean : 2.741 Mean : 1.675
3rd Qu.:16448 3rd Qu.:129.0 3rd Qu.: 3.000 3rd Qu.: 2.000
Max. :21742 Max. :620.0 Max. :14.000 Max. :14.000
is_exact_address_hidden floor neighborhood_id rent_price
Mode :logical Min. :-1.000 Length:7854 Min. : 35
FALSE:1828 1st Qu.: 1.000 Class :character 1st Qu.: 850
TRUE :6026 Median : 2.000 Mode :character Median :1223
Mean : 2.602 Mean :1335
3rd Qu.: 4.000 3rd Qu.:1758
Max. : 9.000 Max. :2517
buy_price house_type_id is_renewal_needed built_year
Min. : 36000 Length:7854 Mode :logical Min. :1800
1st Qu.: 180000 Class :character FALSE:6203 1st Qu.:1956
Median : 315000 Mode :character TRUE :1651 Median :1970
Mean : 404489 Mean :1968
3rd Qu.: 550000 3rd Qu.:1991
Max. :1378900 Max. :2022
has_ac has_lift has_pool has_terrace
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:3546 FALSE:1928 FALSE:6388 FALSE:4583
TRUE :4308 TRUE :5926 TRUE :1466 TRUE :3271
has_balcony has_storage_room is_accessible energy_certificate
Mode :logical Mode :logical Mode :logical Length:7854
FALSE:6565 FALSE:5386 FALSE:5679 Class :character
TRUE :1289 TRUE :2468 TRUE :2175 Mode :character
has_parking
Mode :logical
FALSE:5519
TRUE :2335
3.3 Key Findings or Patterns
Property Size: Both rental and buying prices generally increase with property size, as expected, with larger properties catering to a premium market. For properties exceeding 200 square meters, price variability increases, likely influenced by factors like location or specific property features.
Rooms and Bathrooms: The number of rooms and bathrooms is a strong determinant of both rental and buying prices, particularly in the buying market, where these features appeal to families and high-end buyers. In the rental market, prices for larger properties appear to hit a ceiling (€2,500), suggesting affordability constraints.
Energy Efficiency: Energy certificates show limited correlation with property prices, indicating that energy efficiency may not yet be a major pricing factor in Madrid. Older properties, especially pre-1950, often have lower efficiency ratings (F or G), while newer properties built after 2000 tend to have higher ratings (A or B), reflecting modern construction standards. Many properties still lack finalized energy certifications, particularly newer ones marked as “in process.”
Neighborhood Influence: Central and northern neighborhoods command the highest average buy and rent prices, likely due to proximity to amenities, prestige, and infrastructure quality. Southern neighborhoods are more affordable, indicating less demand or lower economic development in these areas.
Property Types Across Neighborhoods: Flats dominate the housing market across all neighborhoods, offering a practical and affordable option for urban living. Duplexes and penthouses are less common but are concentrated in neighborhoods like Hortaleza and Moncloa, reflecting a localized demand for more spacious or luxury properties.
Amenities and Prices: Practical amenities like lifts and parking have a significant impact on property prices, particularly in dense urban areas where they are highly valued. Luxury features like pools and terraces also increase prices but cater to a smaller segment of buyers or renters with specific preferences. The distinct price premium associated with lifts underscores their necessity in multi-story buildings.
Amenities by Property Type: Flats focus on essential features like lifts, while penthouses and duplexes are more likely to include high-end amenities such as terraces and parking, aligning with their luxury positioning. This distribution highlights a flexible housing market that caters to both practical and premium demands.
Age and Price: Both very old properties (e.g., pre-1900) and newer constructions tend to command higher prices. Older properties likely appeal to buyers for their historic charm and prime locations, while newer ones attract demand for modern features and energy efficiency. Properties from the mid-20th century tend to be less valued, possibly due to their age and lack of modernization, presenting opportunities for redevelopment or upgrades.
This analysis reflects the complexity and diversity of Madrid’s housing market, shaped by a combination of structural features, location, and buyer preferences. These insights provide a solid foundation for further predictive modeling and targeted recommendations.
4 Analysis
To evaluate the predictive performance of different models for property price data, a structured analysis was conducted. Property buy and rent prices in the dataset were log-transformed (buy_price_log and rent_price_log) to address skewness and improve model performance. Three models were implemented: Lasso Regression, Linear Regression, and Random Forest.
4.1 Data Preparation and Splitting
The dataset was prepared by transforming key variables and incorporating additional features, such as clustering results from Gaussian Mixture Models (GMM). GMM was applied to the log-transformed rental prices (rent_price_log) to classify properties into two categories: “Inexpensive” and “Expensive.” These categories were used as an additional predictor for rent price models. The dataset was then split into training and testing sets using a 90-10 split. The training set (90% of the data) was used to train the models, while the testing set (10%) was reserved for evaluation. This approach ensured robust performance metrics and minimized overfitting.
Three models (Lasso Regression, Linear Regression, and Random Forest) were trained to predict buy_price_log. Their performance was evaluated using the Root Mean Squared Error (RMSE) and R-squared (R²) values. The results are as follows:
Model
RMSE
R²
Random Forest
0.1783596
0.9318762
Linear Regression
0.2244886
0.8946683
Lasso Regression
0.2244320
N/A
The Random Forest model achieved the lowest RMSE, demonstrating its ability to capture complex, non-linear relationships in the data. Linear Regression and Lasso Regression performed similarly but less effectively than Random Forest.
The models were also evaluated for predicting rent_price_log. The GMM clustering results (“Inexpensive” and “Expensive”) were included as a feature in these models. The performance metrics were:
Model
RMSE
R²
Random Forest
0.1300083
0.9160775
Linear Regression
0.1783225
0.7696447
Lasso Regression
0.1782763
N/A
Random Forest significantly outperformed the other models, with the lowest RMSE and the highest R². The inclusion of GMM clustering likely enhanced the model’s ability to distinguish patterns, further improving its accuracy.
Code
# 1. Create two separate datasets for rent and buy predictions# Dataset for Rent Predictiondata_cleaned_analysis_rent <- data_cleaned %>%mutate(is_exact_address_hidden =as.factor(is_exact_address_hidden),neighborhood_id =as.factor(neighborhood_id),house_type_id =as.factor(house_type_id),is_renewal_needed =as.factor(is_renewal_needed),energy_certificate =as.factor(energy_certificate),gmm_cluster =as.factor(gmm_cluster), # GMM cluster as factorprice_category =as.factor(price_category) # Price category as factor ) %>%select(-buy_price, -rent_price, -id, -buy_price_log) # Exclude buy_price_log# Dataset for Buy Predictiondata_cleaned_analysis_buy <- data_cleaned %>%mutate(is_exact_address_hidden =as.factor(is_exact_address_hidden),neighborhood_id =as.factor(neighborhood_id),house_type_id =as.factor(house_type_id),is_renewal_needed =as.factor(is_renewal_needed),energy_certificate =as.factor(energy_certificate),gmm_cluster =as.factor(gmm_cluster), # GMM cluster as factorprice_category =as.factor(price_category) # Price category as factor ) %>%select(-buy_price, -rent_price, -id, -rent_price_log) # Exclude rent_price_log# 2. Splitting the data into training and test datasetsset.seed(123) # Set seed for reproducibility# Split data for Rent Predictiontraining_data_rent <- data_cleaned_analysis_rent %>%sample_frac(0.9)test_data_rent <-anti_join(data_cleaned_analysis_rent, training_data_rent)# Split data for Buy Predictiontraining_data_buy <- data_cleaned_analysis_buy %>%sample_frac(0.9)test_data_buy <-anti_join(data_cleaned_analysis_buy, training_data_buy)# 3. Lasso Model# Adjusted for rent and buy separatelyevaluate_lasso <-function(training_data, test_data, target_var, include_gmm =FALSE) { predictors <-if (include_gmm) "."elsecolnames(training_data)[!colnames(training_data) %in%c("gmm_cluster", target_var)] x_train <-model.matrix(reformulate(predictors, target_var), data = training_data)[, -1] # Remove intercept y_train <- training_data[[target_var]] x_test <-model.matrix(reformulate(predictors, target_var), data = test_data)[, -1] y_test <- test_data[[target_var]] cv_lasso <-cv.glmnet(x_train, y_train, alpha =1) best_lambda <- cv_lasso$lambda.min lasso_model <-glmnet(x_train, y_train, alpha =1, lambda = best_lambda) predictions <-predict(lasso_model, s = best_lambda, newx = x_test) rmse <-sqrt(mean((predictions - y_test)^2))return(list(rmse = rmse, model = lasso_model))}# Rent Prediction - Lassolasso_rent_results <-evaluate_lasso(training_data_rent, test_data_rent, "rent_price_log", include_gmm =TRUE)rmse_rent_lasso <- lasso_rent_results$rmselasso_rent_model <- lasso_rent_results$model# Buy Prediction - Lassolasso_buy_results <-evaluate_lasso(training_data_buy, test_data_buy, "buy_price_log", include_gmm =FALSE)rmse_buy_lasso <- lasso_buy_results$rmselasso_buy_model <- lasso_buy_results$model# 4. Linear Regression Model# Adjusted for rent and buy separatelyevaluate_lm <-function(training_data, test_data, target_var, include_gmm =FALSE) { predictors <-if (include_gmm) "."elsecolnames(training_data)[!colnames(training_data) %in%c("gmm_cluster", target_var)] lm_model <-lm(reformulate(predictors, target_var), data = training_data) predictions <-predict(lm_model, newdata = test_data) rmse <-sqrt(mean((predictions - test_data[[target_var]])^2)) r_squared <-summary(lm_model)$r.squaredreturn(list(rmse = rmse, r_squared = r_squared, model = lm_model))}# Rent Prediction - Linear Regressionlm_rent_results <-evaluate_lm(training_data_rent, test_data_rent, "rent_price_log", include_gmm =TRUE)rmse_lm_rent <- lm_rent_results$rmselm_rent_model <- lm_rent_results$model# Buy Prediction - Linear Regressionlm_buy_results <-evaluate_lm(training_data_buy, test_data_buy, "buy_price_log", include_gmm =FALSE)rmse_lm_buy <- lm_buy_results$rmselm_buy_model <- lm_buy_results$model# 5. Random Forest Model# Adjusted for rent and buy separatelyevaluate_rf <-function(training_data, test_data, target_var, include_gmm =FALSE) { predictors <-if (include_gmm) "."elsecolnames(training_data)[!colnames(training_data) %in%c("gmm_cluster", target_var)] rf_model <-randomForest(reformulate(predictors, target_var), data = training_data, importance =TRUE, ntree =500) predictions <-predict(rf_model, newdata = test_data) rmse <-sqrt(mean((predictions - test_data[[target_var]])^2)) r_squared <-1-sum((test_data[[target_var]] - predictions)^2) /sum((test_data[[target_var]] -mean(test_data[[target_var]]))^2)return(list(rmse = rmse, r_squared = r_squared, model = rf_model))}# Rent Prediction - Random Forestrf_rent_results <-evaluate_rf(training_data_rent, test_data_rent, "rent_price_log", include_gmm =TRUE)rmse_rf_rent <- rf_rent_results$rmserf_rent_model <- rf_rent_results$model# Buy Prediction - Random Forestrf_buy_results <-evaluate_rf(training_data_buy, test_data_buy, "buy_price_log", include_gmm =FALSE)rmse_rf_buy <- rf_buy_results$rmserf_buy_model <- rf_buy_results$model# 6. Compare Model Performancemodel_comparison <-data.frame(Model =c("Lasso - Buy Price Log", "Linear Regression - Buy Price Log", "Random Forest - Buy Price Log","Lasso - Rent Price Log", "Linear Regression - Rent Price Log", "Random Forest - Rent Price Log"),RMSE =c(rmse_buy_lasso, rmse_lm_buy, rmse_rf_buy, rmse_rent_lasso, rmse_lm_rent, rmse_rf_rent),R_Squared =c(NA, lm_buy_results$r_squared, rf_buy_results$r_squared,NA, lm_rent_results$r_squared, rf_rent_results$r_squared))# Split data for buy and rent price modelsmodel_comparison_buy <- model_comparison %>%filter(grepl("Buy Price Log", Model))model_comparison_rent <- model_comparison %>%filter(grepl("Rent Price Log", Model))# 7. Visualize Model Performance# Define custom colorscustom_colors <-c("Rent Price Log"= rent_color, "Buy Price Log"= buy_color)
4.3 Insights from Model Comparison
The Random Forest model consistently demonstrated superior performance across both buy and rent price predictions. Its high R² values (above 0.9) indicate excellent explanatory power, while the low RMSE values suggest strong predictive accuracy. Linear Regression and Lasso Regression were less effective, with comparable RMSEs and lower R² values.
The inclusion of GMM clustering for rent price prediction was particularly beneficial, as it helped the models capture variations in price categories. This feature, combined with the Random Forest’s ensemble approach, resulted in the best overall performance.
4.3.1 Visual Representation of Results
The performance of the models is visually summarized in bar plots, comparing RMSE values for both buy_price_log and rent_price_log. The plots clearly show the superiority of Random Forest in both scenarios. Notably, the RMSE for rent price predictions is significantly lower, underscoring the benefit of additional features like GMM clustering.
The neighborhood of Salamanca exhibits the strongest positive impact on buy prices (0.2842), followed by Chamartín (0.2094) and Chamberí (0.2439).
Among property features, the presence of a lift (0.2196), a pool (0.1255), and parking (0.1012) significantly elevate buy prices.
Negative Influence:
The neighborhood of Villaverde shows the most substantial negative impact on buy prices (-0.7046), followed by Villa de Vallecas (-0.5698) and Usera (-0.5588).
The category “Inexpensive” within price clustering (-0.2212) and properties in need of renewal (-0.0569) also negatively influence buy prices.
Positive Influence:
Chamberí (0.9074) is the strongest positively correlated neighborhood for rent prices, with significant contributions also from Centro (0.1402) and Salamanca (0.1258).
Property features like the presence of a lift (0.1919), a pool (0.0917), and room count (0.0456) positively influence rent prices.
Negative Influence:
Villaverde (-0.4689) and Puente de Vallecas (-0.4235) are the neighborhoods with the most negative effect on rent prices.
Properties in the neighborhood of Usera (-0.3497) and with hidden exact addresses (-0.0178) also detract slightly from rental value.
4.4.4 Comparative Insights into Coefficients and Feature Importance
4.4.4.1 Consistent Patterns
Across all models, neighborhoods like Salamanca, Chamberí, and Centro positively influence both buy and rent prices, though their relative impacts differ. Conversely, Villaverde consistently exhibits the most significant negative influence in all cases.
Features such as lift presence and parking availability emerge as significant drivers for both price types, underscoring the importance of these amenities.
4.4.4.2 Divergent Patterns
Certain neighborhoods, while positive in both models, exhibit higher impacts on one price type. For instance, Chamberí strongly influences rent prices, whereas Salamanca leads for buy prices.
Negative influences from specific neighborhoods show greater magnitude for buy prices than rent prices, with differences in the specific neighborhoods emphasized in each model.
4.4.4.3 Interpretation Random Forest
The Random Forest analysis supports and extends the coefficient findings by emphasizing the interaction effects of neighborhood and property features. These results highlight the dominant role of location and critical property characteristics like size, lift presence, and parking in predicting both buy and rent prices.
5 Conclusion
5.1 Summary
We conducted an extensive analysis of the Madrid housing dataset, including data cleaning and an examination of key factors such as property type, size, number of rooms, energy certificate, year of construction, and location. Visualizations revealed trends in property size, room count, neighborhood influence, and amenities like lifts and parking, all of which significantly impact prices. The core insights show that location has the highest impact on both buying and rental prices, followed by the property having amenities such as a lift and a pool. Other features such as for example the size and type of housing also have a significant influence on the price. This answers our research questions that not only the primary more obvious factors like the size, but also the less intuitive aspects have a significant impact that cannot be ignored. Specifically looking at the trends of location, the analysis shows us that central and northern districts positively influence the prices, while southern areas decrease prices. Predictive modeling identified Random Forest as the most effective tool for capturing price determinants and forecasting trends. These findings provide valuable insights into Madrid’s housing market and support data-driven decision-making for buyers, renters, and investors.
5.2 Final Thoughts
A key challenge in this project has been managing missing values and potential biases within the dataset. For certain variables, particularly boolean ones, we made assumptions to handle missing values. Additionally, the dataset’s static nature, representing a single snapshot in time, limits our ability to capture seasonal trends and evolving market dynamics in Madrid.
While we cannot expand beyond the dataset’s scope, we can highlight specific limitations, such as the lack of time-series data or additional contextual information, that affect our ability to fully capture market trends. These points will help guide recommendations for enhancing similar analyses in the future and achieving a more complete view of Madrid’s housing market.
Baldominos, A., Blanco, I., Moreno, A. J., Iturrarte, R., Bernárdez, Ó., & Afonso, C. (2018). Identifying Real Estate Opportunities using Machine Learning. arXiv preprint arXiv:1809.04933.https://arxiv.org/abs/1809.04933
Jafari, A., & Akhavian, R. (2019). Driving forces for the US residential housing price: a predictive analysis. Built Environment Project and Asset Management, 9(4), 515–529. https://doi.org/10.1108/bepam-07-2018-0100
Kenyon, G. E., Arribas-Bel, D., & Robinson, C. (2024). Extracting Features from Satellite Imagery to Understand the Size and Scale of Housing Sub-Markets in Madrid. Land, 13(5), 575. https://www.mdpi.com/2073-445X/13/5/575
Source Code
---title: Housing Market Analysis in Madridauthor: - Buchner, Julia - Reichle, Franziska - Sima, Stefanoinstitute: University of Lausannedate: 12/23/2024title-block-banner: "#0095C8" # chosen for the university of lausannetoc: truetoc-location: rightformat: html: number-sections: true html-math-method: katex self-contained: true code-overflow: wrap code-fold: true code-tools: true include-in-header: # add custom css to make the text in the `</> Code` dropdown black text: | <style type="text/css"> .quarto-title-banner a { color: #000000; } </style># pdf: # use this if you want to render pdfs instead# include-in-header: # wrapping the code also in the pdf (otherwise, it overflows)# text: |# \usepackage{fvextra}# \DefineVerbatimEnvironment{Highlighting}{Verbatim}{# commandchars=\\\{\},# breaklines, breaknonspaceingroup, breakanywhere# }abstract: | This study analyzes the primary factors influencing rental and buying property prices in Madrid's housing market, leveraging a dataset from Kaggle. The analysis involved extensive data cleaning to address missing values, standardize variables, and handle outliers, ensuring robust results. Exploratory data analysis revealed trends based on property type, size, number of rooms, energy certification, year of construction, location, and amenities like lifts and parking. Larger properties in central and northern neighborhoods command higher prices, while flats dominate the market due to their affordability and versatility. Luxury properties, such as penthouses, reflect a niche demand for exclusivity. \n Predictive modeling was conducted using Lasso Regression, Linear Regression, and Random Forest to forecast log-transformed rental and buying prices. Gaussian Mixture Models (GMM) were applied to classify rental prices into "Inexpensive" and "Expensive" categories, improving the prediction accuracy. Random Forest emerged as the best-performing model, achieving the lowest RMSE (0.1783596 for buying prices, 0.1300083 for rental prices) and the highest R² (0.9318762 and 0.9160775, respectively), demonstrating its ability to capture complex relationships in the data. Limitations included the dataset's static nature and assumptions made for handling missing values. The results provide valuable insights for stakeholders, including investors and policymakers, and highlight the importance of location, property size, and amenities in determining housing prices. These findings establish a foundation for further research and predictive modeling in urban real estate markets.---```{r}#| label: setup#| echo: false#| include: false#| message: false# loading all the necessary packagessource(here::here("src", "setup.R"))```{{< include sections/_introduction.qmd >}}{{< include sections/_data.qmd >}}{{< include sections/_eda.qmd >}}{{< include sections/_analysis.qmd >}}{{< include sections/_conclusion.qmd >}}